For this project it was decided to work with movie scripts as the basis for text analysis. 641 movie scripts were downloaded from the largest source of movie scripts on the World Wide Web, www.imsdb.com. Details about how the movie scripts were downloaded can be found in Part 2.
First we read in the data from our IMDb dataset to get the nesscessary information about every English language movie from 1986 to 2016.
Total number of movies from the IMDb dataset is 4377, which is considerably more movies than we have of movie scripts. The text analysis will therefore only have data about the movies which have movie scripts, discluding information about the rest of movies.
# Read-in imdb data
import csv
movies = {}
with open('imdb_dataset_v7.2_6_actors_complete.tsv') as csvfile:
reader = csv.DictReader(csvfile, delimiter = "\t")
for entry in reader:
movies[
entry["title"]
] = {
"director": entry["director"],
"rating": entry["rating"],
"votes": entry["votes"],
"year": entry["year"],
"genre": entry["genre"],
"gross": entry["gross"],
"budget": entry["budget"],
"run-time": entry["run-time"] ,
"actor1": entry["actor1"],
"actor1_rank": entry["actor1_rank"],
"actor1_sex": entry["actor1_sex"],
"actor2": entry["actor2"],
"actor2_rank": entry["actor2_rank"],
"actor2_sex": entry["actor2_sex"],
"actor3": entry["actor3"],
"actor3_rank": entry["actor3_rank"],
"actor3_sex": entry["actor3_sex"],
"actor4": entry["actor4"],
"actor4_rank": entry["actor4_rank"],
"actor4_sex": entry["actor4_sex"],
"actor5": entry["actor5"],
"actor5_rank": entry["actor5_rank"],
"actor5_sex": entry["actor5_sex"],
"actor6": entry["actor6"],
"actor6_rank": entry["actor6_rank"],
"actor6_sex": entry["actor6_sex"],
"plot": entry["plot"]
}
Movie scripts have many terms that are not related to the story, which in fact we want to get rid of.
# List of terms associated with movie script writing
movie_scripts_terms = ['written', 'writer', 'int', 'ext', 'day', 'night', 'morning',
'evening', 'fade', 'cut', 'continued', 'cont', 'contd', 'continuing',
'toward', 'towards', 'overlapping', 'screentalk', 'screen',
'talk', 'offscreen', 'pan', 'pans', 'tilt', 'tilts', 'camera',
'movie', 'film', 'filming', 'gesture', 'gesturing']
First we create a function to tokenize the words we get from the movie scripts. The list which this functions returns includes no numbers and no punctuations by using regular expressions.
It is also important to not include words that have capital letters within them. The reason for that is all characters names as well as places start with capital letters, which we want to exclude from our text analyse. Stop words, terms associated with writing a movie script and words that have fewer letters than 3 are as well discluded to give better picture of the words the movie scripts are made of.
import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
# Function for getting tokens from text
def tokens(text):
# Remove numbers
no_numbers = re.sub(r'\d+(\.\d+)?', '', text)
# Remove punctuations and tokenize
tokens_rm_punct = RegexpTokenizer(r'\w+')
tokens_without_punct = tokens_rm_punct.tokenize(no_numbers)
# Remove word that have capital letters
no_starting_capital_letters = []
for word in tokens_without_punct:
if(word[0].isupper() == False):
no_starting_capital_letters.append(word)
# Remove stop words
no_stop_words = [word for word in no_starting_capital_letters if word not in stopwords.words('english')]
# Remove terms associated with movie script writing
no_movie_script_terms = [word for word in no_stop_words if word not in movie_scripts_terms]
# Remove words that have less than 3 characters
filtered_words = [word for word in no_movie_script_terms if len(word) > 2]
return filtered_words
The files containing the movie scripts are in .txt format and we read them in by encoding them in unicode.
import os
import io
# Read-in movie scripts and add words to a dictonary of tokens
path = './scripts/'
scripts_tokens = {}
for filename in os.listdir(path):
if("." in filename[-5:]):
# Replace '.' when it appears after the year in filenames with '/' to match the imdb dataset
correct_movie_name = filename.replace(".", "/")
f = io.open(path + filename, 'r', encoding='utf8')
scripts_tokens[correct_movie_name] = tokens(f.read())
else:
f = io.open(path + filename, 'r', encoding='utf8')
scripts_tokens[filename] = tokens(f.read())
We want to see if it relates to the genre, how positive or negative a movie is based on its movie script.
To see how positive or negative a movie script is, the LabMT wordlist was downloaded. It's available as supplementary material from Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter (Data Set S1).
The list was generated by collecting tweets over a three year period running from September 9, 2008 to September 18, 2011. To the nearest million, the data set comprises 46.076 billion words contained in 4.586 billion tweets posted by over 63 million individual users.
For the evaluations, users on Mechanical Turk were asked to rate how a given word made them feel on a nine point integer scale, obtaining 50 independent evaluations per word.
The data set consists of 10,222 words and their average happiness evaluations according to users on Mechanical Turk.
import csv
# Go through the data Data_Set_S1.txt and list words and their happiness average into a dictonary
word_sentiment_dic = {}
with open('./Data_Set_S1.txt', 'r') as f:
# Skip the headers
lines_after_3 = f.readlines()[3:]
reader = csv.reader(lines_after_3, delimiter='\t')
for row in reader:
word = row[0]
happiness_avg = row[2]
word_sentiment_dic[word] = happiness_avg
Function to determine how positive or negative a word is (its sentimental score).
def get_sentiment_values(tokens):
tokens_happiness_avg = []
for word in tokens:
# Add to list if the word is found, otherwise not
if(word in word_sentiment_dic):
tokens_happiness_avg.append(float(word_sentiment_dic[word]))
return tokens_happiness_avg
We create a dictionary containing every movie which has a movie script as a key, and its values are sentiment score for every word within the movie script.
import numpy as np
# Get sentiment value for each token from the scripts
movie_scripts_sentiment = {}
for movie in scripts_tokens:
movie_scripts_sentiment[movie] = get_sentiment_values(scripts_tokens[movie])
Average sentiment score is then calculated for every movie script.
# Get average sentiment value for each script
movie_scripts_sentiment_avg = {}
for movie in movie_scripts_sentiment:
movie_scripts_sentiment_avg[movie] = np.average(movie_scripts_sentiment[movie])
Now that we have average sentiment score for every movie, we can print out the top 10 positive and negative movie scripts. Higher sentiment score means the movie script has more positive words than negative words on average.
import operator
# Top-10 negative movie scripts based on sentiment analysis
movie_scripts_sentiment_avg_sorted = sorted(movie_scripts_sentiment_avg.items(), key=operator.itemgetter(1))
print "Top-10 negative movie scripts based on sentiment analysis"
for i in movie_scripts_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
# Top-10 positive movie scripts based on sentiment analysis
movie_scripts_sentiment_avg_sorted = sorted(movie_scripts_sentiment_avg.items(), key=operator.itemgetter(1), reverse=True)
print "Top-10 positive movie scripts based on sentiment analysis"
for i in movie_scripts_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
This results were expected. Movies that are topping the negative lists have darker themes than movies topping the positive list, which have more lighter themes.
We also want to find out how the directors are ranking based on their movies' sentiments scores. To have the list more accurate we are excluding directors that have directed less than 2 movies in the dataset for movie scripts.
from collections import defaultdict
# Add sentiment scores to each director
movie_scripts_directors_sentiments = defaultdict(list)
for movie in scripts_tokens:
movie_scripts_directors_sentiments[movies[movie]["director"]].append(movie_scripts_sentiment_avg[movie])
# Calculate the average sentiment score for each director
movie_scripts_directors_sentiment_avg = {}
for director, sentiment_scores in movie_scripts_directors_sentiments.iteritems():
# Disclude directors who have directed 2 movies or less
if(len(sentiment_scores) > 2):
movie_scripts_directors_sentiment_avg[director] = np.average(sentiment_scores)
# Top-10 negative directors based on sentiment analysis
movie_scripts_directors_sentiment_avg_sorted = sorted(movie_scripts_directors_sentiment_avg.items(), key=operator.itemgetter(1))
print "Top-10 negative directors based on sentiment analysis"
for i in movie_scripts_directors_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
# Top-10 positive directors based on sentiment analysis
movie_scripts_directors_sentiment_avg_sorted = sorted(movie_scripts_directors_sentiment_avg.items(), key=operator.itemgetter(1), reverse=True)
print "Top-10 positive directors based on sentiment analysis"
for i in movie_scripts_directors_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
Here are the results not as decisive as with the movies. This is due to the limitations of the dataset. Allthough, we can see that directors that are directing comedy and more upbeat movies are ranking at the top of the positive list (Cameron Crowe, Woody Allen and James L. Brooks) and directors asscoiated with action movies are ranking high in the negative list (Alex Proyas, Guillermo del Toro, Michael Bay and Sam Raimi).
Next thing will be looking at how the actors rank. As with directors, we are excluding actors that have appear in 2 movies or less.
actors_list = ["actor1", "actor2", "actor3", "actor4", "actor5", "actor6"]
# Add sentiment scores to each actor
movie_scripts_actors_sentiments = defaultdict(list)
for movie in scripts_tokens:
for actor in actors_list:
movie_scripts_actors_sentiments[movies[movie][actor]].append(movie_scripts_sentiment_avg[movie])
# Calculate the average sentiment score for each actor
movie_scripts_actors_sentiment_avg = {}
for actor, sentiment_scores in movie_scripts_actors_sentiments.iteritems():
# Disclude actors who have appeared in 2 movies or less
if(len(sentiment_scores) > 2):
movie_scripts_actors_sentiment_avg[actor] = np.average(sentiment_scores)
# Top-10 negative actors based on sentiment analysis
movie_scripts_actors_sentiment_avg_sorted = sorted(movie_scripts_actors_sentiment_avg.items(), key=operator.itemgetter(1))
print "Top-10 negative actors based on sentiment analysis"
for i in movie_scripts_actors_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
# Top-10 positive actors based on sentiment analysis
movie_scripts_actors_sentiment_avg_sorted = sorted(movie_scripts_actors_sentiment_avg.items(), key=operator.itemgetter(1), reverse=True)
print "Top-10 positive actors based on sentiment analysis"
for i in movie_scripts_actors_sentiment_avg_sorted[:10]:
print(str(i[0]) + ": " + str(i[1]))
The results here are little more random. The reason for that is that we would probably need much more movie scripts to find out how the actors would rank accurately. Forest Whitaker, an Oscar Winner, is at the top of the negative list. He is mainly starring in drama and action movies. Courteney Cox, mainly associated with comedy movies is somehow at the top 10 in the negative list, an odd one out. Comedy actors, like Joan Cusack, Michael Sheen, Paul Rudd and Zach Galifianakis are at the top of the positive list, while drama and action actors are at the top of the negative list. So the list is somewhat accurate, and can give us clues which movies actors are likely to star in.
Next up are the genres. It would be obvious that comedy movies would be more positive than horror movies, but does the data containing 641 movie scripts confirm that assumption? To give better results genres with less than 10 movies are excluded.
# Add genre to each movie
movie_scripts_genres = defaultdict(list)
for movie in scripts_tokens:
movie_scripts_genres[movie].append(movies[movie]["genre"])
# Get list of movies for each genre
movie_scripts_genres = defaultdict(list)
for movie in scripts_tokens:
movie_scripts_genres[movies[movie]["genre"]].append(movie)
# Add sentiment scores to each genre
movie_scripts_genres_sentiments = defaultdict(list)
for movie in scripts_tokens:
movie_scripts_genres_sentiments[movies[movie]["genre"]].append(movie_scripts_sentiment_avg[movie])
# Calculate the average sentiment score for each genre
movie_scripts_genres_sentiment_avg = {}
for genre, sentiment_scores in movie_scripts_genres_sentiments.iteritems():
# Disclude genres that have less than 10 movies
if(len(sentiment_scores) > 9):
movie_scripts_genres_sentiment_avg[genre] = np.average(sentiment_scores)
# Sentiment scores for genres
movie_scripts_genres_sentiment_avg_sorted = sorted(movie_scripts_genres_sentiment_avg.items(), key=operator.itemgetter(1))
print "Sentiment scores for genres from negative to positive"
for i in movie_scripts_genres_sentiment_avg_sorted:
print(str(i[0]) + ": " + str(i[1]))
This results confirms the assumption. Comedy movies are the most positive genre out there. And it shouldn't surprise anyone that action, horror and crime are ranking higher in negativity than other genres.
from collections import Counter
# Add number of movie scripts in each genre to a dict
genre_num_of_scripts = {}
for genre in movie_scripts_genres:
genre_num_of_scripts[genre] = len(movie_scripts_genres[genre])
# Sort by number of movie scripts
genre_num_of_scripts_sorted = sorted(genre_num_of_scripts.items(), key=operator.itemgetter(1), reverse=True)
print "Number of movie scipts in a genre"
for i in genre_num_of_scripts_sorted:
print(str(i[0]) + ": " + str(i[1]))
We want to see if some words are more used than others within a specific genre and if it's possible to see any themes within the genres.
To do that we use tf-idf to determine how frequent words are and their importance within a specific genre.
We start with determining the term frequency of a word within its genre.
# Count how often a token appears in a specific genre (tf values)
def tokens_tf_genre(tokens_genre):
movie_scripts_genre_tokens = []
for movie in tokens_genre:
movie_scripts_genre_tokens += scripts_tokens[movie]
tokens_counter = Counter(movie_scripts_genre_tokens)
# Combine list of keys and values
tokens_tf = zip(tokens_counter.keys(), tokens_counter.values())
# Delete duplicates and sort
tokens_tf_sorted = sorted(set(tokens_tf))
return tokens_tf_sorted
Next is calculating the idf to determine how important that word is.
import math
# Calculate idf values for each word in a genre
def tokens_idf_genre(tokens_genre):
# Get get every word in the corpus (all movies)
total_word_list = []
for movie in tokens_genre:
# Delete duplicated tokens from every movie script
total_word_list += set(scripts_tokens[movie])
# Count how often a token appears in a corpus for a genre
# (equals to how many movie scripts that token is in)
total_word_list_count = Counter(total_word_list)
# Delete duplicates and sort
total_word_list_sorted = sorted(set(total_word_list))
idf_values = []
for token in total_word_list_sorted:
# log(Total number of articles / how many documents have that token)
idf_value = math.log(len(tokens_genre)/float(total_word_list_count[token]))
idf_values.append(idf_value)
# Combine list of keys and values
tokens_idf = zip(total_word_list_sorted, idf_values)
return tokens_idf
Tf-idf is calculated by multiplying the values for tf and idf together.
# Calculate tf-idf values for each word in a genre
def tokens_tfidf_genre(tokens_tf, tokens_idf):
# Unzip lists to get values
tokens_tf, values_tf = zip(*tokens_tf)
tokens_idf, values_idf = zip(*tokens_idf)
# Multiply tf and idf values
tokens_tfidf_values = [a*b for a, b in zip(values_tf, values_idf)]
# Combine list of keys and values
tokens_tfidf = zip(tokens_tf, tokens_tfidf_values)
return tokens_tfidf
The Wordcloud image takes a string of words as an argument and displays an image based on how often the words appear in the string.
Analysis of the Wordcloud images for each genre is after the last Wordcloud image.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
# Function for creating a Wordcloud image
def wordcloud_image(tfidf):
wordcloud_text = ""
for i in tfidf:
num_of_repeats = int(round(i[1]))
wordcloud_text += (num_of_repeats * (i[0] + " "))
wordcloud = WordCloud(width=1600, height=800).generate(wordcloud_text)
plt.figure(figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
# Creates Wordcloud image by using tf-idf
def wordcloud_tfidf(name_of_genre):
name_of_genre_tf = tokens_tf_genre(movie_scripts_genres[name_of_genre])
name_of_genre_idf = tokens_idf_genre(movie_scripts_genres[name_of_genre])
name_of_genre_tfidf = tokens_tfidf_genre(name_of_genre_tf, name_of_genre_idf)
print("Number of movie scripts in genre: " + str(len(movie_scripts_genres[name_of_genre])))
print("Movie scripts represented in the image")
print(str(movie_scripts_genres[name_of_genre]))
wordcloud_image(name_of_genre_tfidf)
# Wordcloud image for Action genre
wordcloud_tfidf("Action")
# Wordcloud image for Comedy genre
wordcloud_tfidf("Comedy")
# Wordcloud image for Drama genre
wordcloud_tfidf("Drama")
# Wordcloud image for Crime genre
wordcloud_tfidf("Crime")
# Wordcloud image for Adventure genre
wordcloud_tfidf("Adventure")
# Wordcloud image for Biography genre
wordcloud_tfidf("Biography")
# Wordcloud image for Horror genre
wordcloud_tfidf("Horror")
# Wordcloud image for Mystery genre
wordcloud_tfidf("Mystery")
The best representation of a genre is clearly the crime genre, which is a lot more specific than other genres which are displayed here. Words like gun, cell, vault, guard, casino, shotgun and profanities (fuck) are common in the image, which really reflects the themes of crime movies.
The image for the action genre includes many words related to weapons, with the word sword the most obvious one. Words related to transport and creatures (dragon, alien and vampire) are also common. All in all, a fairly good representation of the action genre.
The comedy genre has the problem of being a very broad genre, many movies can fall into that category. The word nun is somehow very popular within the movies in the dataset that make up the comedy genre and profanites have a strong presence as well.
The same thing can be described of the drama genre. Movies that are consider drama can have many different themes and subjects, so words related to transport stand out.
Other genres have considerable less words to play with and that is reflected in their wordclouds, movies within the genres have much more weight in the images.